Template Extraction from Heterogeneous Web Pages with Cosine Similarity

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Methodology for Template Extraction from Heterogeneous Web Pages

The World Wide Web is a vast and most useful collection of information. To achieve high productivity in publishing the web pages are automatically evaluated using common templates with contents. The templates are considered harmful because they compromise the relevance judgement of many web information retrieval and web mining methods such as clustering and classification and badly impact the p...

متن کامل

Effective and Enhanced method for Template Extraction from Heterogeneous Web Pages

To achieve high productivity publishing the web pages are automatically evaluated using common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. Cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. This process propo...

متن کامل

Unsupervised Structured Data Extraction from Template-generated Web Pages

This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle t...

متن کامل

Automatic Data Extraction from Template Generated Web Pages

Information Retrieval calls for accurate web page data extraction. To enhance retrieval precision, irrelevant data such as navigational bar and advertisement should be identified and removed prior to indexing. We propose a novel approach that identifies the web page templates and extracts the unstructured data. Our experimental results on several different web sites demonstrate the feasibility ...

متن کامل

A Similarity Reinforcement Algorithm for Heterogeneous Web Pages

Many machine learning and data mining algorithms crucially rely on the similarity metrics. However, most early research works such as Vector Space Model or Latent Semantic Index only used single relationship to measure the similarity of data objects. In this paper, we first use an Intraand InterType Relationship Matrix (IITRM) to represent a set of heterogeneous data objects and their inter-rel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Computer Applications

سال: 2014

ISSN: 0975-8887

DOI: 10.5120/15186-3546